An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
نویسندگان
چکیده
Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction remote image information and flexible human–computer interaction. Traditional RSCTIR methods mainly focus on improving the performance uni-modal feature separately, most rely pre-trained object detectors obtain better local representation, which not only lack multi-modal interaction information, but also cause training gap between detector task. In this paper, we propose an end-to-end framework based vision-language fusion (EnVLF) consisting two (vision language) encoders a muti-modal encoder can be optimized by multitask training. Specifically, achieve process, introduce vision transformer module for features instead detector. By semantic alignment visual text features, achieves same as features. addition, trained improve top-one top-five ranking performances after processing. Experiments common RSICD RSITMD datasets demonstrate that our EnVLF state-of-the-art performance.
منابع مشابه
End-to-End Adaptive Framework for Multimedia Information Retrieval
The evolution of the web in the last decades has created the need for new requirements towards intelligent information retrieval capabilities and advanced user-oriented services. The current web integrates heterogeneous and distributed data such as XML database, relational database, P2P networks etc. leading to the coexistence of different data models and consequently different query languages....
متن کاملRemote Sensing Image Fusion Based on Two-Stream Fusion Network
Remote sensing image fusion (or pan-sharpening) aims at generating high resolution multi-spectral (MS) image from inputs of a high spatial resolution single band panchromatic (PAN) image and a low spatial resolution multi-spectral image. In this paper, a deep convolutional neural network with two-stream inputs respectively for PAN and MS images is proposed for remote sensing image pan-sharpenin...
متن کاملCross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems
In query-by-semantic-example image retrieval, images are ranked by similarity of semantic descriptors. These descriptors are obtained by classifying each image with respect to a pre-defined vocabulary of semantic concepts. In this work, we consider the problem of improving the accuracy of semantic descriptors through cross-modal regularization, based on auxiliary text. A cross-modal regularizer...
متن کاملImage Compression Based on Compressive Sensing: End-to-End Comparison with JPEG
We present an end-to-end image compression system based on compressive sensing. The presented system integrates the conventional scheme of compressive sampling and reconstruction with quantization and entropy coding. The compression performance, in terms of decoded image quality versus data rate, is shown to be comparable with JPEG and significantly better at the low rate range. We study the pa...
متن کاملCross-modal Retrieval by Text and Image Feature Biclustering
We describe our approach to the ImageCLEF-Photo 2007 task. The novelty of our method consists of biclustering image segments and annotation words. Given the query words, we may select the image segment clusters that have strongest cooccurrence with the corresponding word clusters. These image segment clusters act as the selected segments relevant to a query. We rank text hits by our own tf.idf ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Mathematics
سال: 2023
ISSN: ['2227-7390']
DOI: https://doi.org/10.3390/math11102279